Post

One Giant Leap: 95% Less Sampling Cost

Imagine your company is managing a large-scale data processing system, and every second counts. You're constantly seeking ways to reduce latency and optimize performance. Now, consider the impact of speeding up thread monitoring on user CPU time by 20 times. The public standard in Linux for reading thread CPU time is the computational equivalent of printing a spreadsheet to paper, then scanning it with a camera, only to digitize it back into a spreadsheet just to read a value. It works, but it destroys throughput. I recently integrated a patch into OpenJDK that replaces this legacy logic with direct calls. The result is a 20x speedup for thread monitoring of user CPU time. Here is the story behind the optimization and why "everything is a file" isn't the right philosophy for performance.

One Giant Leap: 95% Less Sampling Cost

1. Background

1.1. The Observer Effect

In physics, the observer effect dictates that the mere act of measuring a system inevitably alters its state. You cannot check a tyre’s pressure without letting a little air out. Computer science is also bound by this law: observability comes at a cost. If you want deep insights into thread performance, you must be willing to pay in CPU cycles. For instance, each time you sample CPU usage, it may cost around 500 CPU cycles. Though this might seem negligible, in a high-frequency sampling scenario, these costs can quickly accumulate, affecting performance. While physics imposes a fundamental lower bound on this cost, implementation choices often set the upper bound much higher than necessary.

To understand performance, we cannot rely solely on wall-clock time. A thread waiting for a database response for five seconds has a different impact on CPU resources than a thread performing AI inference for the same period. The latter has a substantially higher cost on CPU resources. CPU time may be used to capture these costs. To this end, we also separate between user CPU time and system CPU time. User time is CPU time spent only on the application’s operations, and system time is CPU time spent on the OS/kernel.

For years, the OpenJDK implementation of ThreadMXBean.getCurrentThreadUserTime() on Linux was a textbook example of the gap between physical laws and an implementation limit that was far from the minimum cost. However, it was robust, correct, and adhered strictly to the Unix philosophy that “everything is a file”.

1.2. Reading Pseudo-Text Files Is More Expensive Than You Think

Historically, to get thread CPU time, the JVM relied on reading a file from the /proc directory, which contains pseudo-text files maintained by the Linux kernel. While /proc usage is excellent while running shell commands, it is subpar for high-frequency sampling. Every time the JVM checked user CPU time for a thread, it metaphorically printed a spreadsheet to paper, then scanned it with a camera and digitized it back to a spreadsheet just to read a value. Or in technical terms: because /proc is a pseudo-filesystem, the file content is generated just-in-time. The just-in-time (JIT) technique is not reserved for only compilers! The kernel samples statistics from the internal structures it maintains, formats it into a text string, and copies it to user space. The JVM then parses that text back into an integer. Let us unpack what this really means.

When a Java program calls ThreadMXBean.getCurrentThreadUserTime() the JVM will begin with calling open("/proc/...", O_RDONLY):

sequenceDiagram
  autonumber
  participant JVM as JVM (user space)
  participant Kernel as Kernel
  participant Hardware as CPU/Memory

  JVM->>JVM: Allocate buffer
  JVM->>Kernel: 1. open("/proc/...", O_RDONLY)
  Kernel-->>JVM: File Descriptor

This is cheap; it just returns the file descriptor (fd) for the pseudo-file that will contain text. A buffer with the expected string content will be populated if and only if a user calls read upon that file descriptor, e.g. read(fd, buffer, 2047). This is the next step that the JVM takes as it needs to populate its own buffer.

sequenceDiagram
  autonumber 4
  participant JVM as JVM (user space)
  participant Kernel as Kernel
  participant Hardware as CPU/Memory

  JVM->>Kernel: 2. read(fd, buffer, 2047)

  rect rgba(19, 19, 19, 0)
  Note right of Kernel: Print to paper phase
  Kernel->>Kernel: Allocate buffer
  Kernel->>Hardware: Fetch statistics
  Hardware-->>Kernel: Building a ~340 character long string

  loop For ~52 fields in statistics convert to string<br>and put in buffer
      Kernel->>Kernel: Convert PID
      Kernel->>Kernel: Convert State
      Note right of Kernel: ...burning cycles on irrevlant data...
      Kernel->>Kernel: Convert user time (Bingo!)
      Note right of Kernel: ...formatting 30+ more fields...
  end
  end

  Kernel-->>JVM: Copy 2047 bytes to the JVM's buffer

Someone who read the legacy OpenJDK implementation that my patch replaced might have noticed that it reads only the first 2047 bytes of the /proc file. While that certainly is a neat optimization, it does not prevent the kernel from fully populating the buffer to completion. Even if you just want to read the first byte of the file, the kernel still has to lock the thread for safety reasons and fill the entire buffer before returning a single byte. This means the perceived efficiency of reading a partial buffer is undermined by this necessity, resulting in additional cycles spent in lock contention and the overhead of engaging a longer kernel path. For instance, these operations could cost critical microseconds during high-frequency reads, amplifying the observer effect and impacting overall system performance.

After the data crosses into user space, the JVM must iterate through the buffer, skipping irrelevant tokens until it reaches the user time field. Only then is the substring parsed and converted back into a primitive integer.

sequenceDiagram
  autonumber 12
  participant JVM as JVM (user space)
  participant Kernel as Kernel
  participant Hardware as CPU/Memory

  rect rgba(230, 230, 255, 0)
  Note Right of JVM: The digitalization phase
  JVM->>JVM: String parsing
  end

  JVM->>Kernel: 3. close(fd)
  Kernel-->>JVM: Success

Beyond the runtime cost, this approach introduces complexity. As one reviewer for my patch recalled, the fragility of parsing text from the kernel has been a source of bugs in the past.

1.2.1. Memory Overheads

The largest possible value for user CPU time requires just 20 bytes, albeit it is unlikely you would ever observe such a large value. Moreover, on my machine, the /proc file payload averages 340 bytes. The payload size varies because the numbers you observe depend on the impact it has on system resources, etc.

This means we are forced to wrap a 20-byte message in 340 bytes. This packaging is expensive: the kernel serializes raw integers into ASCII text, copies the bulk data to the JVM, and the JVM must then search and parse the result back into an integer. Only then do we extract the value and release the buffer.

But the memory inefficiency runs deeper. Since the kernel generates this content on the fly, it does not know the final size upfront. Rather than attempting a precise allocation, it defaults to a standard 4KB page for the buffer, see fs/seq_file.c

Reading the source code is good, but watching it run is more fun. I wrote a reproduction case to catch it in the act: a small C program to trigger the read and trace the kernel allocator using eBPF. With eBPF you can create kernel programs without having to recompile the entire kernel. This enables precise observability capabilities that would otherwise be hard to achieve.

Sample program that reads /proc/self/task/pid/stat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/syscall.h>

int main() {
    char path[64];
    char buffer[2] = {0};
    sprintf(path, "/proc/self/task/%zd/stat", syscall(SYS_gettid));
    int fd = open(path, O_RDONLY);

    // Read first byte of path and put it in buffer
    int read_bytes = read(fd, buffer, 1); // Triggers internal kmalloc of a 4kB page
    printf("%d bytes contained: %s\n", read_bytes, buffer);
    close(fd); // Will free the 4kb page

    return 0;
}

Compile with: gcc read_proc.c -o read_proc

eBPF to probe kernel allocations caused by read()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Trace the allocation caused by call to read().
// Reads on a file like /proc/self/task/%zd/stat will
// casquade down to seq_read_iter
kprobe:seq_read_iter /comm == "read_proc"/ {
    @read_triggered = 1;
}
tracepoint:kmem:kmalloc /comm == "read_proc" && @read_triggered/ {
    @pointer_to_page = args->ptr;
    printf("\n[read(...)] bytes requested: %d, allocated: %d\n", args->bytes_req, args->bytes_alloc);
    kstack();
}
kretprobe:seq_read_iter /comm == "read_proc"/ {
    delete(@read_triggered);
}

// Trace the deallocation caused by call to close()
tracepoint:syscalls:sys_enter_close /comm == "read_proc"/ {
    @close_triggered = 1;
}
kprobe:kfree, kprobe:kvfree /comm == "read_proc"/ {
    if (arg0 == @pointer_to_page && @close_triggered) {
        printf("[close(...)] Page was freed.\n");
        kstack();
        delete(@pointer_to_page);
    }
}
tracepoint:syscalls:sys_exit_close /comm == "read_proc"/ {
    delete(@close_triggered);
}
'

Run as sudo bpftrace trace.bt

Running this probe confirms the hypothesis: the kernel allocates a full 4KB page for the internal buffer.

1
2
3
4
Attaching 7 probes...

[read(...)] bytes requested: 4096, allocated: 4096
[close(...)] Page was freed.

This confirms that the actual memory overhead is even higher than it appears. We aren’t just wasting the unused characters in the text string. We are paying for a full memory page. Over 99% of the allocated space is waste.

1.3. Observations at Scale

The computational overhead became a bottleneck for high-throughput systems like PrestoDB. As detailed in JDK-8210452, the maintainers of Presto—a distributed SQL engine processing massive datasets—found that the resource cost of monitoring threads was unacceptably high, forcing them to adopt a “workaround” that essentially stopped performing the measurement, i.e., stopped calling ThreadMXBean.getCurrentThreadUserTime().

2. A Walk Through History

UNIX is an operating system developed by Bell Labs in 1969. While its foundation became the industry standard, each manufacturer would buy a license to the source code and use it to create its own flavour. The differences prevented users from moving an application from a system made by, e.g., IBM, to one made by HP. This led to the creation of the POSIX standard. With this, users could finally move their applications between any computer, regardless of whether IBM or HP was the vendor behind the UNIX implementation.

In 1991, a young Finnish guy named Linus Torvalds decided he wanted to run UNIX on his new home computer, but could not afford an official UNIX license. Being a computer science student, he also recognized that he had the knowledge to recreate a UNIX-like operating system himself, which is what he did. This operating system is what we today call Linux. It treated POSIX as a guideline, but since it never sought an expensive official POSIX certification, it could also deviate from it.

LinuxThreads (LT) was the library that implemented support for threads. However, LT was not POSIX compliant and had fundamental design issues, causing severe problems with the signal sytem (more details here, page 3). Native POSIX Thread Library (NPTL) was the solution that replaced it and as the name implies it is POSIX compatible.

With this in mind, the decision to read /proc was not a bad one. On the contrary, it was the only available option. The ThreadMXBean.getCurrentThreadUserTime() method was introduced in Java 1.5, released on September 30, 2004, and development on it likely began well before that date. The POSIX standard only specifies that the total CPU time should be available, not its components (user/system time).

NPTL was integrated with the release of Linux 2.6, which—to my knowledge—did not initially support any functions to read CPU time. It did, however, introduce support for reading thread CPU time via /proc (see /proc/[pid]/task). On June 17, 2005, with the release of Linux 2.6.12, function-based support for querying both total and user CPU time was finally added. Technically, this distinction is encoded in the last three bits of the clock ID.

The only problem is that this mechanism is undocumented; discovering it requires reading the source code. This is significant because Linux user APIs are designed to never break. So, Jonas, isn’t this just an internal detail the kernel developers are trying to hide? I don’t think so. An attempt to explicitly define what officially constitutes the user API wasn’t made until December 2012. Glibc has relied on the fact that the last three bits of a clock ID define the CPU time data for over 20 years. This makes that part of the kernel a de facto user API. Additionally, over all these years, glibc has even documented them as the kernel ABI; see here. Given how tightly the kernel and glibc depend on each other in most deployed systems worldwide, the kernel developers cannot change this detail without breaking most of the world. Therefore, it is safe to use.

3. Implementation

Reads from /proc and subsequent string parsing are replaced with code that changes the clock ID encoding to retrieve user CPU time instead of total CPU time.

1
2
3
4
5
6
7
8
9
10
constexpr clockid_t CLOCK_TYPE_MASK = 3; // Clear out any previous set bits
constexpr clockid_t CPUCLOCK_VIRT = 1;   // User CPU time

clockid_t clock_id;
pthread_getcpuclockid(pthread_self(), &clock_id);
clockid_t clock_id = (clock_id & ~CLOCK_TYPE_MASK) | CPUCLOCK_VIRT // Clear and set user CPU time bit

struct timespec tp;
clock_gettime(clock_id, &tp);
return (tp.tv_sec * NANOSECS_PER_SEC) + tp.tv_nsec;

4. Performance Impact

4.1. C++ Microbenchmark

I wrote a C++ microbenchmark to measure the time it takes to query user CPU time, comparing the legacy method that reads /proc to my new implementation that calls clock_gettime. Regardless of whether we examine the worst-case (max) or best-case (min) scenario of each method, the legacy method is comparatively much slower. Idle measures the time when the machine is not doing anything else, and a “stressed” system spawns a number of threads that call mmap (allocate memory) and munmap (deallocate memory). Calling these methods causes contention over shared resources needed to read /proc. The maximum worst-case (100th percentile) is 1.8x slower and takes almost 3.6 ms, and the median (50th percentile) is 7x slower, still >2ms compared to 0.3 ms for clock_gettime. I think, from an observability perspective, the worst cases are the most interesting here. It is when the system is under heavy load that we really want to understand the costs. For example, if you need to sample a method’s user CPU time cost, then paying 2-4 ms causes so severe observer effects that your results could be highly unreliable if the method is only executing for a couple of milliseconds. While clock_gettime isn’t perfect, it is a substantial improvement, making it more feasible to measure user CPU time at a fine-grained level.

4.2. Java Microbenchmark

To ensure that the implementation was correct and to confirm the observations in the C++ microbenchmark still hold, I also created and integrated a Java benchmark to test the new code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
@State(Scope.Benchmark)
@Warmup(iterations = 2, time = 5)
@Measurement(iterations = 5, time = 5)
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Threads(1)
@Fork(value = 10)
public class ThreadMXBeanBench {
    static final ThreadMXBean mxThreadBean = ManagementFactory.getThreadMXBean();
    static long user; // To avoid dead-code elimination

    @Benchmark
    public void getCurrentThreadUserTime() throws Throwable {
        user = mxThreadBean.getCurrentThreadUserTime();
    }
}

This benchmark is similar to the idle situation in the C++ benchmark. We are not doing anything except calling user CPU time. Additionally, I performed the benchmark on another machine/CPU to confirm the robustness of my performance observation. The results are similar, but notably the worst-case was even worse for the legacy code (/proc), a full 1 ms was needed to sample user CPU time! The reduction for the 100th percentile here is 95% (and the inspiration for the blog title).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Legacy code (/proc):
Benchmark                  Mode      Cnt  Score    Error  Units
CPUTime.execute          sample  7506555  0.008 ±  0.001  ms/op
CPUTime.execute:p0.00    sample           0.008           ms/op
CPUTime.execute:p0.50    sample           0.008           ms/op
CPUTime.execute:p0.90    sample           0.008           ms/op
CPUTime.execute:p0.95    sample           0.008           ms/op
CPUTime.execute:p0.99    sample           0.012           ms/op
CPUTime.execute:p0.999   sample           0.015           ms/op
CPUTime.execute:p0.9999  sample           0.021           ms/op
CPUTime.execute:p1.00    sample           1.030           ms/op

New code (clock_gettime):
Benchmark                  Mode      Cnt   Score    Error  Units
CPUTime.execute          sample  8984189  ≈ 10⁻³           ms/op
CPUTime.execute:p0.00    sample           ≈ 10⁻³           ms/op
CPUTime.execute:p0.50    sample           ≈ 10⁻³           ms/op
CPUTime.execute:p0.90    sample           ≈ 10⁻³           ms/op
CPUTime.execute:p0.95    sample           ≈ 10⁻³           ms/op
CPUTime.execute:p0.99    sample            0.001           ms/op
CPUTime.execute:p0.999   sample            0.001           ms/op
CPUTime.execute:p0.9999  sample            0.006           ms/op
CPUTime.execute:p1.00    sample            0.054           ms/op

5. Conclusion

I have demonstrated that Linux offers a significantly faster API for querying user CPU time—one that remains largely overlooked because it is poorly documented. I hope this blog post can help spread awareness. More importantly, any Java developer who upgrades to JDK 26 will be able to enjoy blazingly fast access to user CPU time.

As we look to the future, the question remains: what other “everything-is-a-file” costs linger untamed in the technological stack? Identifying these inefficiencies could pave the way for the next substantial performance enhancement. Let’s challenge ourselves to uncover the next opportunity for a 95% saving 😎.

JDK-8372584 and PR.

The views expressed in this blog are my own and do not necessarily reflect the views of Oracle.